# Lab 7 - Linear regression continued

We will continued learning about linear regression by predicting health insurance prices.

First download the dataset from GitHub: [https://github.com/stedy/Machine-Learning-with-R-datasets/blob/master/insurance.csv](https://github.com/stedy/Machine-Learning-with-R-datasets/blob/master/insurance.csv)

In this data, each row represents an insurance policy and the 7 columns contain the following information about it:
- age: age of policy holder
- sex: sex of policy holder
- bmi: boday mass index (bmi) of policy holder. bmi is a (sometimes unreliable) measurement of body fat in adults
- children: number of children (dependents) on the policy
- smoker: whether the policy holder is a smoker
- region: region of the country the policy holder lives in
- charges: price for insurance policy

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.formula.api as smf
import seaborn as sns

%matplotlib inline

Load the CSV file into a dataframe and display it:

## Exploratory Data Analysis

To get a feel for the data, let's do some quick exploratory data analysis.

What's the histogram of the bmi column?

What distribution does the bmi data have?

Plot scatter plots of all pairs of quantitative variables (hint: use the Seaborn function `pairplot` from Lab 1 to plot them all at once)

Do any of the variables have a linear relationship with the charges?

Use Seaborn to make a scatter plot with bmi on the x axis, charges on the y axis, and colored by whether the person is a smoker or not (see Lab 3).

Which appears to have the larger effect on the charge: the policy holder's bmi or whether they are a smoker?

Next use Seaborn to make a scatter plot with age on the x axis and charges on the y axis, colored by whether the person is a smoker.

What do you notice about the plot?

## Linear regression

Perform linear regression to predict the insurance charge, with age as the independent variable.

What is the equation for the linear model?

How much does this model predict your insurance will increase next year when you are 1 year older?

How much would a 25 year old pay? We can predict this using our model. First we make a new DataFrame with the age with want to make the prediction for. 

In [None]:
new_data = pd.DataFrame({'age' :[25]})
new_data

Then we make the prediction:

In [None]:
lm.predict(new_data)

What about if you are 30, 40, or 50 years old? We can compute the predicted charges of all of these ages at once by making a data frame containing all three ages: 

In [None]:
three_ages = pd.DataFrame({'age': [30,40,50]})
three_ages

Make the prediction using this new dataframe:

Use the Seaborn package to plot a scatter plot of age vs charges with the regression line on it (see Lab 6):

What do you notice about the scatter plot?

Let's see if this shows up in the plots of the residuals. Plot the histogram of the residuals.

Does this look like a normal distribution?

Let's also plot the fitted values (y) against the actual charges (x):

Alternatively, we could plot the ages (x) vs. the residuals (y):

Clearly age does not provide the whole picture. In fact, the R-squared value in the summary (top right corner) is the proportion of variance in the charges that is explained by this model. Right now this is about 9% which is not good....

However we see that the p-values (the P > |t|) column in the summary is very close to 0. The p-value is the probability that that coefficient is 0, so there is a linear effect of age on insurance prices.

Let's add the other quantitative columns as independent variables to see if we can get a better fit.

In [None]:
lm2 = smf.ols('charges ~ age + bmi + children', data = insurance).fit()
lm2.summary()

What is the equation of this linear model?

Has the R-squared value improved?

Looking at the p-values, could any of the coefficients be 0? 

Now let's plot the residuals. First, plot a histogram of the residuals:

What do you notice? Are the residuals normal?

Next, let's plot the actual charges (x) vs the predicted charges (y):

Did adding bmi and children improve the linear model?

Let's add the remaining columns. Sex, smoker, and region are all categorical variables. But there is a way to make them into quantitative data using *dummy variables*.

For example, consider the sex column. There are two categories in it: female and male. We will encode this using one dummy variable that will be 1 if the sex is male and 2 if the sex is female.

In [None]:
insurance_new = pd.get_dummies(insurance, columns = ["sex"], drop_first = True)
insurance_new.head()

Let's make the other qualitative columns into dummy variables:

In [None]:
insurance_new = pd.get_dummies(insurance_new, columns = ["smoker", "region"], drop_first = True)
insurance_new.head()

How was the region column, which had 4 categories, turned into dummy variables?

Now let's make a linear regression model using all these columns:

Has the R-squared value improved?

Looking at the p-values, could any of the coefficients be 0? Next class will we learn how to decide which independent variables to include in your linear model.

Now let's plot the residuals. First, plot a histogram of the residuals:

What do you notice? Are the residuals normal?

Next, let's plot the actual charges (x) vs the predicted charges (y):

What do you notice? Has the model improved?